Goto

Collaborating Authors

 Lead


Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

Srinivasan, Ashwin, Baskar, A, Dash, Tirtharaj, Bain, Michael, Dey, Sanjay Kumar, Banerjee, Mainak

arXiv.org Artificial Intelligence

We investigate a relatively underexplored class of hybrid neurosymbolic models integrating symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In \textit{Symbolic Neural Generators} (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a triple $(H, X, W)$, where $H$ is a symbolic description of feasible instances constructed from data, $X$ a set of generated new instances that satisfy the description, and $W$ an associated weight. We introduce a semantics for such systems, based on the construction of appropriate \textit{base} and \textit{fibre} partially-ordered sets combined into an overall partial order, and outline a probabilistic extension relevant to practical applications. In this extension, SNGs result from searching over a weighted partial ordering. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.



A Collaborative Framework Integrating Large Language Model and Chemical Fragment Space: Mutual Inspiration for Lead Design

Tuo, Hao, Li, Yan, Hu, Xuanning, Zhao, Haishi, Liu, Xueyan, Yang, Bo

arXiv.org Artificial Intelligence

Drug design, particularly in the discovery of lead compounds, is of core strategic importance to combating disease and enhancing human well-being. Prevailing computational methods, however, struggle to effectively integrate domain-specific knowledge, severely limiting their capacity to identify novel lead compounds with validated binding modes and new scaffolds. Here, we propose AutoLeadDesign, a lead compounds design framework that inspires extensive domain knowledge encoded in large language models with chemical fragments to progressively implement efficient exploration of vast chemical space. The comprehensive experiments indicate that AutoLeadDesign outperforms baseline methods. Significantly, empirical lead design campaigns targeting two clinically relevant targets (PRMT5 and SARS-CoV-2 PLpro) demonstrate AutoLeadDesign's competence in de novo generation of lead compounds, achieving expert-competitive design efficacy. Structural analysis further confirms their mechanism-validated inhibitory patterns. By tracing the process of design, we find that AutoLeadDesign shares analogous mechanisms with fragment-based drug design, which traditionally rely on expert decision-making, further revealing why it works. Overall, AutoLeadDesign offers an efficient approach for lead compound design, suggesting its potential utility in drug design.


DrugGen: Advancing Drug Discovery with Large Language Models and Reinforcement Learning Feedback

Sheikholeslami, Mahsa, Mazrouei, Navid, Gheisari, Yousof, Fasihi, Afshin, Irajpour, Matin, Motahharynia, Ali

arXiv.org Artificial Intelligence

Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer-based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time-consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine-tuned on approved drug-target interactions and optimized with proximal policy optimization. By giving reward feedback from protein-ligand binding affinity prediction using pre-trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule (Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high-quality small molecules, DrugGen provides a high-performance medium for advancing pharmaceutical research and drug discovery.


Y-Mol: A Multiscale Biomedical Knowledge-Guided Large Language Model for Drug Development

Ma, Tengfei, Lin, Xuan, Li, Tianle, Li, Chaoyi, Chen, Long, Zhou, Peng, Cai, Xibao, Yang, Xinyu, Zeng, Daojian, Cao, Dongsheng, Zeng, Xiangxiang

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have recently demonstrated remarkable performance in general tasks across various fields. However, their effectiveness within specific domains such as drug development remains challenges. To solve these challenges, we introduce \textbf{Y-Mol}, forming a well-established LLM paradigm for the flow of drug development. Y-Mol is a multiscale biomedical knowledge-guided LLM designed to accomplish tasks across lead compound discovery, pre-clinic, and clinic prediction. By integrating millions of multiscale biomedical knowledge and using LLaMA2 as the base LLM, Y-Mol augments the reasoning capability in the biomedical domain by learning from a corpus of publications, knowledge graphs, and expert-designed synthetic data. The capability is further enriched with three types of drug-oriented instructions: description-based prompts from processed publications, semantic-based prompts for extracting associations from knowledge graphs, and template-based prompts for understanding expert knowledge from biomedical tools. Besides, Y-Mol offers a set of LLM paradigms that can autonomously execute the downstream tasks across the entire process of drug development, including virtual screening, drug design, pharmacological properties prediction, and drug-related interaction prediction. Our extensive evaluations of various biomedical sources demonstrate that Y-Mol significantly outperforms general-purpose LLMs in discovering lead compounds, predicting molecular properties, and identifying drug interaction events.


Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Generation

Neural Information Processing Systems

Recently, utilizing reinforcement learning (RL) to generate molecules with desired properties has been highlighted as a promising strategy for drug design. Molecular docking program -- a physical simulation that estimates protein-small molecule binding affinity -- can be an ideal reward scoring function for RL, as it is a straightforward proxy of the therapeutic potential. Still, two imminent challenges exist for this task. First, the models often fail to generate chemically realistic and pharmacochemically acceptable molecules. Second, the docking score optimization is a difficult exploration problem that involves many local optima and less smooth surface with respect to molecular structure.


Accelerating Drug Safety Assessment using Bidirectional-LSTM for SMILES Data

Rao, K. Venkateswara, Rao, Kunjam Nageswara, Ratnam, G. Sita

arXiv.org Artificial Intelligence

Computational methods are useful in accelerating the pace of drug discovery. Drug discovery carries several steps such as target identification and validation, lead discovery, and lead optimisation etc., In the phase of lead optimisation, the absorption, distribution, metabolism, excretion, and toxicity properties of lead compounds are assessed. To address the issue of predicting toxicity and solubility in the lead compounds, represented in Simplified Molecular Input Line Entry System (SMILES) notation. Among the different approaches that work on SMILES data, the proposed model was built using a sequence-based approach. The proposed Bi-Directional Long Short Term Memory (BiLSTM) is a variant of Recurrent Neural Network (RNN) that processes input molecular sequences for the comprehensive examination of the structural features of molecules from both forward and backward directions. The proposed work aims to understand the sequential patterns encoded in the SMILES strings, which are then utilised for predicting the toxicity of the molecules. The proposed model on the ClinTox dataset surpasses previous approaches such as Trimnet and Pre-training Graph neural networks(GNN) by achieving a ROC accuracy of 0.96. BiLSTM outperforms the previous model on FreeSolv dataset with a low RMSE value of 1.22 in solubility prediction.


AI revolutionized the battlefield in 2023 as Israel, China lead development amid tech arms race

FOX News

America's Newsroom anchor Bill Hemmer looks back at the top headlines of the past 12 months. The mainstream attention on artificial intelligence (AI) in 2023 allowed militaries to more openly discuss some of the astonishing initiatives they've undertaken as they race toward the future of warfare. AI presented an entirely different challenge and revealed an arms race many did not even know had already gotten well underway: Advanced and automated targeting capabilities, virtual environment weapon testing and AI-controlled vehicles present just the tip of a substantial and rapidly developing iceberg. The allure of AI is so strong that the Pentagon has some 800 AI-related unclassified projects in the works to attain a "force multiplier" integration and gain the upper hand over its rivals. This year gave the general public a better idea of where militaries stand with their astonishing development and where they might head next.


An open unified deep graph learning framework for discovering drug leads

Yin, Yueming, Hu, Haifeng, Yang, Zhen, Yang, Jitao, Ye, Chun, Wu, Jiansheng, Goh, Wilson Wen Bin

arXiv.org Artificial Intelligence

Computational discovery of ideal lead compounds is a critical process for modern drug discovery. It comprises multiple stages: hit screening, molecular property prediction, and molecule optimization. Current efforts are disparate, involving the establishment of models for each stage, followed by multi-stage multi-model integration. However, this is non-ideal, as clumsy integration of incompatible models increases research overheads, and may even reduce success rates in drug discovery. Facilitating compatibilities requires establishing inherent model consistencies across lead discovery stages. Towards that effect, we propose an open deep graph learning (DGL) based pipeline: generative adversarial feature subspace enhancement (GAFSE), which first unifies the modeling of these stages into one learning framework. GAFSE also offers standardized modular design and streamlined interfaces for future expansions and community support. GAFSE combines adversarial/generative learning, graph attention network, graph reconstruction network, and optimizes the classification/regression loss, adversarial/generative loss, and reconstruction loss simultaneously. Convergence analysis theoretically guarantees model generalization performance. Exhaustive benchmarking demonstrates that the GAFSE pipeline achieves excellent performance across almost all lead discovery stages, while also providing valuable model interpretability. Hence, we believe this tool will enhance the efficiency and productivity of drug discovery researchers.


Lead Business Intelligence Analyst

#artificialintelligence

Want to join a company where doing good is what we do? The right data can make that happen. As a member of the Enterprise Data Solutions team, you'll have the opportunity to work with data assets and data professionals across the entire organization to improve efficiencies, increase profitability and better manage risk. You'll also perform analysis and develop data solutions to help solve real business problems rather than shortsighted quick fixes. Partnering with your teammates to understand project requirements and user insights, you'll continually focus on supporting Amica's strategic plan to create peace of mind and build enduring relationships.